Random Forests for Big Data

نویسندگان

  • Robin Genuer
  • Jean-Michel Poggi
  • Christine Tuleau-Malot
  • Nathalie Villa-Vialaneix
چکیده

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploratory Data Analysis using Random Forests

Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists, they are still widely considered to be "black box" models that are not well suited for substantive research: only prediction. We argue that this need not be the case, and present one method, Random Forests, with an emphasis on its practical application for exploratory analysis a...

متن کامل

Energy Efficient Data Mining Scheme for Big Data Biodiversity Environment

In this paper, we propose a novel energy efficient data mining scheme for big data biodiversity environment. Efficient machine learning and data mining techniques provide an unprecedented opportunity to monitor and characterize big data biodiversity environments, such as forest cover type, monitored using low cost wireless sensor networks. However, given the sheer amount of data collected by th...

متن کامل

Big data for microstructure-property relationships: a case study of predicting effective conductivities

The analysis of big data is changing industries, businesses and research since large amounts of data are available nowadays. In the area of microstructures, acquisition of (3D tomographic image) data is difficult and time-consuming. It is shown that large amounts of data representing the geometry of virtual, but realistic 3D microstructures can be generated using stochastic microstructure model...

متن کامل

Implementation of Random Forest Algorithm in Order to Use Big Data to Improve Real-Time Traffic Monitoring and Safety

Nowadays the active traffic management is enabled for better performance due to the nature of the real-time large data in transportation system. With the advancement of large data, monitoring and improving the traffic safety transformed into necessity in the form of actively and appropriately. Per-formance efficiency and traffic safety are considered as an im-portant element in measuring the pe...

متن کامل

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections

BACKGROUND Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. METHODS We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Big Data Research

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2017